## [1] 4898 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The percentage of alcohol in wine bounds from 8 and 14.20 percent. About 75% of the wines have a residual sugar value below 10 grams/liter (over 45 are considered sweet). Some wines have not citric acid. Mean quality is 5.878, max and min quality are 9 and 3 respectively.
Quality distribution appears unimodal normal distribution with center in 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Most alcohol values are around 9.5 grades, and follow a right skewed normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Most sulphates values are around 0.45 grades, and follow a right skewed normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH follow a normal distribution with mean near 3.2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density follow a normal distribution with mean in 0.994 and some outliers over 1.01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The total sulfur dioxide follow a normal distribution with mean in 138.4 and some outliers over 275. The 50% if the data is between 108 and 167
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The free sulfur dioxide follow a normal distribution with mean near 35 and some outliers over 90. The 50% if the data is between 23 and 46
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Almost 50% of chlorides are between 0.036 and 0.05.
Transformed the long tail data to better understand the distribution of residual sugar. The tranformed residual sugar distribution appears bimodal with the price peaking around 1.5 or so and again at 10 or so. This is one interesting plot.
Based on the context of this feature, it’s a good candidate to transform it in a new factorized variable. The description says that it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet so our new variable could be:
## RARE NORMAL SWEET
## 77 4820 1
Sadly there aren’t enought cases in SWEET and RARE to consider this feature.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The citric acid follow a normal distribution with mean near 0.33 and some outliers over 0.9. The 50% if the data is between 0.27 and 0.39.
Setting the binwidth to 0.01 we can see an anormal amount of values around 0.5 (0.49 exactly).
##
## 0.3 0.28 0.32 0.34 0.29 0.26 0.27 0.49 0.31 0.33 0.24 0.36 0.35 0.25 0.37
## 307 282 257 225 223 219 216 215 200 183 181 177 137 136 134
## 0.38 0.4 0.22 0.39 0.42 0.23 0.41 0.2 0.21 0.44 0.46 0.18 0.19 0.45 0.74
## 122 117 104 101 95 83 82 70 66 63 51 49 48 46 41
## 0.48 0.47 0.43 0.5 0.16 0.14 0.17 0.51 0.15 0.52 0.56 0.58 0 0.12 0.54
## 39 38 37 35 33 27 27 25 23 23 22 21 19 19 19
## 0.13 0.53 0.1 0.62 0.57 0.04 0.07 0.09 0.55 0.61 0.71 0.65 0.01 0.66 0.67
## 17 16 14 14 13 12 12 12 11 9 9 8 7 7 7
## 0.68 0.02 0.06 0.59 0.6 0.64 0.05 0.69 0.72 0.73 1 0.08 0.63 0.7 0.03
## 7 6 6 6 6 6 5 5 5 5 5 4 4 3 2
## 0.78 0.79 0.8 0.81 0.82 0.91 0.11 0.86 0.88 0.99 1.23 1.66
## 2 2 2 2 2 2 1 1 1 1 1 1
Transformed the long tail data to better understand the distribution of volatile acidity. The tranformed volatile acidity distribution appears normal with the acifuty peaking around 0.25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The fixed acidity follow a normal distribution with mean near 6.85 and some outliers over 10 and under 4. The 50% if the data is between 6.3 and 7.3
A new feature could be obtained using the acidities. Some references says that the total acidity is the amount of fixed acidity plus the volatile acidity. But the measure of fixed acidity should be setted (for an easier understanding) to just tartaric acid and not all the non-volatile acids so our total acidity is going to be the sum of all the acids.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.890 7.405 7.467 7.960 14.960
There are 4898 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality).
All the features are numerical, even the quality that is based on a score of 0 to 10. This feature is the easiest one to be factorized for an easy plot interpretations but we are going to mantain both.
Main thoughts:
Seeing the univariate plots, most of the features follow normal distributions with few variability but some outliers.
The main feature in the data set is the quality. The main idea is try to predict the quality of a wine. To accomplish this issue let see what is the behaviour of the quality with some other features.
I’ve learned, reviewing some internet articles, that a good wine quality is given by this formula: Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols). In the case of white wines, the concentration of phenols (tannins, that gives the red color of the wine) are insignificant. So the interesting features for this analysis will be: fixed.acidity, volatile.acidity, citric.acid, total.acidity, residual.sugar, alcohol and quality.
Total acidity is a combination of all the acids. Also the residual sugar has been categorized but the few quantity of data in some categories made this one useless.
Residual sugar maybe is the most unusual distribution cause for a better understanding of the data a log10 has been applied and appears a bimodal distribution. The rest of them seems to be normal distributions, some of them right skewed.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## total.acidity 0.98717874 0.07157062 0.394143356
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.acidity 0.10473749 0.04552987 -0.0451333172
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## total.acidity 0.113188502 0.27560881 -0.4306513315
## sulphates alcohol quality total.acidity
## fixed.acidity -0.01714299 -0.12088112 -0.113662831 0.98717874
## volatile.acidity -0.03572815 0.06771794 -0.194722969 0.07157062
## citric.acid 0.06233094 -0.07572873 -0.009209091 0.39414336
## residual.sugar -0.02666437 -0.45063122 -0.097576829 0.10473749
## chlorides 0.01676288 -0.36018871 -0.209934411 0.04552987
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067 -0.04513332
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218 0.11318850
## density 0.07449315 -0.78013762 -0.307123313 0.27560881
## pH 0.15595150 0.12143210 0.099427246 -0.43065133
## sulphates 1.00000000 -0.01743277 0.053677877 -0.01185225
## alcohol -0.01743277 1.00000000 0.435574715 -0.11751272
## quality 0.05367788 0.43557472 1.000000000 -0.13137721
## total.acidity -0.01185225 -0.11751272 -0.131377207 1.00000000
Some unexpected correlations appears in the features. The density seems to be correlated with the residual sugar and with the alcohol. So lets insclude these ones in the investigation features.
## residual.sugar density alcohol quality
## residual.sugar 1.00000000 0.8389665 -0.4506312 -0.09757683
## density 0.83896645 1.0000000 -0.7801376 -0.30712331
## alcohol -0.45063122 -0.7801376 1.0000000 0.43557472
## quality -0.09757683 -0.3071233 0.4355747 1.00000000
## total.acidity 0.10473749 0.2756088 -0.1175127 -0.13137721
## total.acidity
## residual.sugar 0.1047375
## density 0.2756088
## alcohol -0.1175127
## quality -0.1313772
## total.acidity 1.0000000
The main objective is to know how this features affect the wine quality, but first lest see how others features are related.
Here is a very strong relationship between the residual sugar and the density of the wine. In fact the correlation is 0.84.
Also we see a strong relationship between density and alcohol as we could advance with the correlation of -0.78.
No relationship can be shown, in fact the correlation value for this pairs is -0.13. Seeing the linear model of these features (blue line) we can appreciate almost an horizontal line. This means the slope (total acidity value) has very few importance in this equation.
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.645 7.102 7.935 8.269 9.163 12.410
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.570 7.020 7.590 7.815 8.320 11.520
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 6.970 7.500 7.574 8.120 11.030
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.860 7.370 7.436 7.940 14.960
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.730 6.810 7.310 7.323 7.820 9.870
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.525 6.810 7.370 7.261 7.760 8.930
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.250 7.600 7.850 8.104 8.000 9.820
## quality: 3
## [1] 20
## --------------------------------------------------------
## quality: 4
## [1] 163
## --------------------------------------------------------
## quality: 5
## [1] 1457
## --------------------------------------------------------
## quality: 6
## [1] 2198
## --------------------------------------------------------
## quality: 7
## [1] 880
## --------------------------------------------------------
## quality: 8
## [1] 175
## --------------------------------------------------------
## quality: 9
## [1] 5
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.588 4.600 6.392 10.700 16.200
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
Again any relationship between these features as expected and seeing the linear model of these features we can appreciate almost an horizontal line.
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
## quality: 3
## [1] 20
## --------------------------------------------------------
## quality: 4
## [1] 163
## --------------------------------------------------------
## quality: 5
## [1] 1457
## --------------------------------------------------------
## quality: 6
## [1] 2198
## --------------------------------------------------------
## quality: 7
## [1] 880
## --------------------------------------------------------
## quality: 8
## [1] 175
## --------------------------------------------------------
## quality: 9
## [1] 5
Here a small relationship could be seen. It seems for this dataset the quality of wine increases with the alcohol graduation.
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
## quality: 3
## [1] 20
## --------------------------------------------------------
## quality: 4
## [1] 163
## --------------------------------------------------------
## quality: 5
## [1] 1457
## --------------------------------------------------------
## quality: 6
## [1] 2198
## --------------------------------------------------------
## quality: 7
## [1] 880
## --------------------------------------------------------
## quality: 8
## [1] 175
## --------------------------------------------------------
## quality: 9
## [1] 5
In this case, it seems to be a small relationship between these features but with a low correlation so not important.
The main relationships in this bivariate analysis are found related with the alcohol feature. We could see that it has a strong relationship with the density and the residual sugar.
But no single relationship (at leats remarkable) could be found with the quality. Each of the features analyzed aren’t somehow related with the quality. This is something we can expected because is not that easy to have a good wine quality, isn’t it?
The most interesting relationships involve the density feature. In fact seeing the correlations between features, density has almost always the highest values.
The strongest relationship is between density and residual sugar. A correlation of 0.84 gives us a strong relationship. Also density with alcohol (-0.78) are strongly correlated.
Here we can see that with higher quality values density vs alcohol values seems to be in the top left of the graph and with lower values density vs alcohol fall in the left side (always following the lm, that look similar for every wine quality).
These plots shows that the correlation exists for every quality and seems that values goes from right to left in the linear model when the quality increases.
Now lets see the behaviour of the quality vs the other features of interest
These plots shows how difficult is to obtain a goof quality wine. Very few relationships could be found. In the next section will be explained some thoughts about why this happens.
##
## Calls:
## m1: lm(formula = quality ~ total.acidity, data = wines)
## m2: lm(formula = quality ~ total.acidity + log10(residual.sugar),
## data = wines)
## m3: lm(formula = quality ~ total.acidity + log10(residual.sugar) +
## alcohol, data = wines)
##
## ==========================================================
## m1 m2 m3
## ----------------------------------------------------------
## (Intercept) 6.856*** 6.899*** 2.730***
## (0.106) (0.107) (0.155)
## total.acidity -0.131*** -0.126*** -0.086***
## (0.014) (0.014) (0.013)
## log10(residual.sugar) -0.119*** 0.288***
## (0.031) (0.031)
## alcohol 0.343***
## (0.010)
## ----------------------------------------------------------
## R-squared 0.0 0.0 0.2
## adj. R-squared 0.0 0.0 0.2
## sigma 0.9 0.9 0.8
## F 86.0 50.3 435.2
## p 0.0 0.0 0.0
## Log-likelihood -6312.0 -6304.8 -5775.5
## Deviance 3774.7 3763.7 3032.2
## AIC 12630.0 12617.6 11561.1
## BIC 12649.4 12643.6 11593.6
## N 4898 4898 4898
## ==========================================================
##
## Call:
## lm(formula = quality ~ total.acidity + log10(residual.sugar) +
## alcohol, data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5799 -0.5287 -0.0104 0.4770 3.2541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.729844 0.154573 17.661 < 2e-16 ***
## total.acidity -0.086298 0.012768 -6.759 1.55e-11 ***
## log10(residual.sugar) 0.288359 0.030592 9.426 < 2e-16 ***
## alcohol 0.343059 0.009984 34.361 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7871 on 4894 degrees of freedom
## Multiple R-squared: 0.2106, Adjusted R-squared: 0.2101
## F-statistic: 435.2 on 3 and 4894 DF, p-value: < 2.2e-16
As we could saw in the bivariate section, density with residual sugar and alcohol have a big correlation and as we can appreciate this happens with every wine quality.
Furthermore, a small relationship appears when mixing total acidity with residual sugar and alcohol. In the linear model a 0.2 appears for the R-squared value. This means a 20% of the quality variance could accounted.
As said before, the most interesting feature is the density, analyzed with alcohol and residual sugar. No special interaction could be seen in this section.
In order to predict the wine quality I created a linear model trying to figure out that the balance of alcohol + sugar and total acidity gives a good wine quality. The model doesn’t seem to be very accurate (0.2) event the features used have influence in the model (3 stars in the m3 column).
The distribution of residual sugar amount appears to be bimodal. This is not easy to explain, maybe a demand of a well differenced wine sweet flavour.
Residual sugar is one of the most interesting feature because of its high correlation with others. In this case the relationship with the density is almost linear.
Quality levels have an small (lower than expected) relation with the total acidity and the alcohol + residual sugar combination. Higher quality wines seems to have mess acidity with higher alcohol and residual sugar value.
The white wines data set contains information on almost 5000 wines. First of all an exploratory data analysis was performed to understand the fearures. Also some internet investigation to contextualize and learn about the topic. This gave me some references about how quality could be calculated/predicted given some of the features already provided in the dataset. Before this some relations call my attention like the high relationship of the density with some other features like alcohol and residual sugar. Finally trying to find any relations to set a good quality was quite frustrating. Some internet investigations direct me to this formula: Sweet Taste (sugars + alcohols) <= => Acid Taste (acids). But the final though wasn’t as easy as it seems. I could find a small relationship between this features but looking at the resultant linear model a small qualtity of wines are accounted (20%).
Some conclusions I can extract is that the data set lacks of a more spreaded quality values. Almost all the wines are ‘NORMAL’ and it’s difficult the clusterize. Also I think that my analysis was a bit biased trying to predict the quality given the previous formula.
The higher the sugar, the higher alcohol. Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols) (tannins just red wines). Increase fixed acidity decreases the ph. Increase citric acid increases ph.